-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22672][SQL][TEST] Refactor ORC Tests #19882
Conversation
Hi, @cloud-fan , @gatorsmile , @HyukjinKwon , @viirya . |
Test build #84443 has finished for PR 19882 at commit
|
Whoa big class list. Will take a look soon within tomorrow as well. |
Thank you so much, @HyukjinKwon ! |
emptyDF.createOrReplaceTempView("empty") | ||
|
||
// This creates 1 empty ORC file with Hive ORC SerDe. We are using this trick because | ||
// Spark SQL ORC data source always avoids write empty ORC files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still using Hive ORC SerDe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review, @viirya . I'll update tomorrow.
Instead of having |
It has more lines, doesn't it? In any way, we need helper functions. |
ok maybe have a It can avoid changing the test code from |
Okay. No problem. Thanks, @cloud-fan . |
/** | ||
* A test suite that tests Apache ORC filter API based filter pushdown optimization. | ||
*/ | ||
class OrcFilterSuite extends OrcTest with SharedSQLContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let HiveOrcFilterSuite
extend OrcFilterSuite
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, it's impossible because of the reason I mentioned in PR description.
OrcFilterSuite and HiveOrcFilterSuite cannot reuse test cases due to the different function signatures using Hive 1.2.1 ORC classes and Apache ORC 1.4.1 classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we leave some comments to explain that reason?
Seems there are many duplications and I would wonder why.
Test build #84524 has finished for PR 19882 at commit
|
read | ||
.option(ConfVars.DEFAULTPARTITIONNAME.varname, defaultPartitionName) | ||
spark.read | ||
.option("hive.exec.default.partition.name", defaultPartitionName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new ORC didn't change these config names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In fact, Apache ORC doesn't have this params.
import testImplicits._ | ||
|
||
def orcImp: String = "native" | ||
|
||
var originalConfORCImplementation = "native" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private var
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
|
||
override def orcImp: String = "hive" | ||
|
||
test("SPARK-8501: Avoids discovery schema from empty ORC files") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this test not in native orc test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Native ORC solve this bug. Native ORC have a corresponding test case here.
+ test("Schema discovery on empty ORC files") {
+ // SPARK-8501 is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then why don't we put this test in the base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works in new OrcFileFormat.
- The new test case is in
OrcQuerySuite
for new OrcFileFormat. - And, the old test case is
HiveOrcQuerySuite
for old OrcFileFormat.
|
||
spark.sql( | ||
s"""CREATE TEMPORARY VIEW normal_orc_source | ||
|USING org.apache.spark.sql.hive.orc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be using orc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. It just comes from the original test case.
} | ||
} | ||
|
||
test("SPARK-21791 ORC should support column names with dot") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we need this test case for Hive's one too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old OrcFileFormat fails on this test case.
Do you mean adding an exception-catching test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I overlooked. Sure, that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
override def beforeAll(): Unit = { | ||
class OrcSourceSuite extends OrcSuite with SharedSQLContext { | ||
|
||
protected override def beforeAll(): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind if I ask where this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test cases of OrcSuite assume these tables.
import testImplicits._ | ||
|
||
def orcImp: String = "native" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe val
could be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Done.
It's rebased to the master to resolve conflicts. Also, I addressed the comments. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loosely related though, should we maybe rename org.apache.spark.sql.hive.orc.Orc*
-> org.apache.spark.sql.hive.orc.HiveOrc*
in the main codes too to distinguish the newer ORC from the old Hive ORC?
/** | ||
* A test suite that tests Apache ORC filter API based filter pushdown optimization. | ||
*/ | ||
class OrcFilterSuite extends OrcTest with SharedSQLContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we leave some comments to explain that reason?
Seems there are many duplications and I would wonder why.
|
||
test("filter pushdown - combinations with logical operators") { | ||
withOrcDataFrame((1 to 4).map(i => Tuple1(Option(i)))) { implicit df => | ||
// Because `ExpressionTree` is not accessible at Hive 1.2.x, this should be checked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote the original tests here like this using toString
partly because ExpressionTree
(SearchArgument.getExpression
) it's inaccessible and the string format is easy to read.
Although I think that tree seems available in the ORC, I think it's okay to keep the tests like this. It's easy to read but let's fix up the comments here. It doesn't look about Hive anyway here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'll remove the comment. For the test case, I agree with you. And also this string tests will be consistent with Hive Orc tests for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
LGTM BTW. |
Test build #84581 has finished for PR 19882 at commit
|
Thank you so much, @HyukjinKwon ! |
@HyukjinKwon . The PR code and description is updated like the followings.
The main reason I used prefix As you see in the PR description, we need to use |
I actually suggested similarly before:
but I remember I received an advise at that time and it became as below:
After rethinking it, I realised this is better. Likewise, I actually liked |
I meant in #19882 (review), I liked this |
Oh. I'll bring back. |
This reverts commit 1828571.
I am sorry, I had to clarify this ahead .. |
Definitely, my bad. For the main code, we can do later in a separate PR if needed. I hope this PR contains tests only. |
Test build #84589 has finished for PR 19882 at commit
|
Test build #84591 has finished for PR 19882 at commit
|
retest this please |
Test build #84599 has finished for PR 19882 at commit
|
thanks, merging to master! |
Thank you so much, @cloud-fan , @HyukjinKwon , and @gatorsmile ! |
## What changes were proposed in this pull request? During #19882, `conf` is mistakenly used to switch ORC implementation between `native` and `hive`. To affect `OrcTest` correctly, `spark.conf` should be used. ## How was this patch tested? Pass the tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19931 from dongjoon-hyun/SPARK-22672-2.
What changes were proposed in this pull request?
Since SPARK-20682, we have two
OrcFileFormat
s. This PR refactors ORC tests with three principles (with a few exceptions)sql/core
.HiveXXX
test suite insql/hive
by reusingsql/core
test suite.OrcTest
will provide common helper functions andval orcImp: String
.Test Suites
Native OrcFileFormat
Hive built-in OrcFileFormat
Hierarchy
Please note the followings.
OrcHadoopFsRelationSuite
doesn't inheritOrcTest
. It is insidesql/hive
likeParquetHadoopFsRelationSuite
due to the dependencies and follows the existing convention to useval dataSourceName: String
OrcFilterSuite
s cannot reuse test cases due to the different function signatures using Hive 1.2.1 ORC classes and Apache ORC 1.4.1 classes.How was this patch tested?
Pass the Jenkins tests with reorganized test suites.